People like me working in the field of Natural Language Processing almost always come across the task of replacing words in a text. The reasons behind replacing the words may be different. Some of them are.
Now, if the number of words to replace and the corpus of text is not huge i.e within thousands, then Regular Expressions have always been my solution. But as I started working on bigger and bigger datasets with tens of thousands of documents and sometimes millions, I noticed that performing the above tasks started taking days. In today’s fast-moving world, this is not the amount of time one would want to invest in a very simple but important task. So earlier, it would come down to optimizing the number of words necessary to be changed and time required to replace these words.
But in the early November, I found FlashText – a super blazingly fast library that reduces days of replacement computation time to minutes.
FlashText is a Python library created specifically for the purpose of searching and replacing words in a document. Now, the way FlashText works is that it requires a word or a list of words and a string. The words which FlashText calls keywords are then searched or replaced in the string.
Let us check out in detail about FlashText working. When keywords are passed to FlashText for searching or replacing, they are stored as a Trie Data Structure which is very efficient at Retrieval tasks. Below is an example of how a Trie Data Structure looks like.
Above is a Trie Data Structure for the words (their, there, answer, any, and bye).
Now, in case of searching the keywords, FlashText will return the keywords that are present in the string. In case of replacing, FlashText will create a new string with the keywords replaced. Both these operations happen over a single pass. Now it is important to understand the concept of the single pass.
An example of single pass replacement looks like this-
String = “spamham sha”
Replace “spam” with “eggs” and “sha” with “md5”
Now let’s see how does the String look like with and without a single pass.
Single-pass
String = “eggsham md5”
Without Single-pass
String = “eggmd5m md5”
Above, you can see the difference between the single pass and without the single pass.
Now, we have an overview of how FlashText works. It is imperative that we understand – What is that FlashText has which Regular Expressions don’t? After all, Regular Expressions are mostly considered the one-stop solution for string manipulation in terms of both speed and variety of manipulations that can be done. This can be better understood with the help of an example by the author of the FlashText package himself.
Suppose, there is a string called sample = ” This is a sample sentence” and a collection of words called keywords = {sample, sameer, pony, time}. Now, if we are to perform searching of the keywords in the sample, there are 2 ways of doing it.
Method 1:
for each word in keywords: Line 1
check if word exists in sample Line 2
Now the above method has a loop which will run n times, where n is the number of words in the keywords. Also, there will be significant time consumption in Line 2 which checks whether a particular word is present in a string or not.
Method 2:
for each word in sample: Line 1
check if word exists in the keywords dictionary Line 2
Now, the loop in this method will run m times, where m is the number of words in the sample. The major advantage is the execution time of Line 2. Checking a key in a dictionary is a significantly faster process than checking for a word in a string.
FlashText uses Method 2 for faster searching and replacing and is inspired by the Aho-Corasick Algorithm and Trie Data Structure. Let’s have a look at the way it works:
First, a Trie dictionary is created from the keywords which looks like below.
Start and EOT represent the word boundaries such as space, new_line etc. The idea is that words which have word boundaries on both ends should only match. So, pam will not match to pamella. Now let’s see how we can perform searching,
Since we have our keywords dictionary with us. We pass in the input string sample. Now, the comparison will look like below.
Is <start>This<EOT>in the keywords dictionary? No
Is <start>is<EOT>in the keywords dictionary? No
Is <start>a<EOT>in the keywords dictionary? No
Is <start>sample<EOT>in the keywords dictionary? Yes
Is <start>sentence<EOT>in the keywords dictionary? No
Visually it looks like below.
An important thing to keep in mind is that checking for a word in the dictionary happens at a character level. So for example, while checking for the word “is”, it will search for <start>”i”. Since “i” does not exist anywhere after <start> in the Keywords trie dictionary, therefore the entire word is skipped. This character level matching and skipping are what gives FlashText its speed.
Pretty much every time since it is super fast. Below is the image of the benchmarking done by FlashText for searching and replacing against regular expressions.
Searching
Looking at the image, it looks like when the number of keywords to be searched is below 500 then, regular expressions provide a little edge over FlashText. But as soon as number of keywords cross 500, FlashText surpasses regular expression performance by a wide margin.
Conclusion: Use FlashText when you have to search for a large number of keywords, preferably more than 500.
Replacing
While Regular Expressions presented stiff competition to FlashText in the sub 500 keyword domain for searching, when it comes to replacing FlashText beats Regular Expressions hands down. While there is a linear increase in time as the number of keywords increase, FlashText stays constant.
Conclusion: FlashText is any day better than regular expressions for replacing keywords in a document.
These are the codes for benchmarking of searching and replacing by the author.
There are still a few caveats of using FlashText. As of now, FlashText does not support partial word matching or special characters matching. For that, Regular Expressions are the best.
Installing FlashText is as easy as any other package. In command prompt/ terminal, type:
pip install flashtext
Below is the code to find keywords in a document.
# Replacing all occurences of word 'batman'(case insensitive) with Bruce Wayne
processor = KeywordProcessor(case_sensitive = False)
processor.add_keywords('batman','Bruce Wayne')
found = processor.replace_keywords(document)
print(found)
output:
‘Bruce Wayne is a fictional superhero appearing in American comic books published by DC Comics. The character was created by artist Bob Kane and writer Bill Finger,[4][5] and first appeared in Detective Comics #27 (1939). Originally named the “Bat-Man”, the character is also referred to by such epithets as the Caped Crusader, the Dark Knight, and the World\’s Greatest Detective.[6]\n\nBruce Wayne\’s secret identity is Bruce Wayne, a wealthy American playboy, philanthropist, and owner of Wayne Enterprises. After witnessing the murder of his parents Dr. Thomas Wayne and Martha Wayne as a child, he swore vengeance against criminals, an oath tempered by a sense of justice. Bruce Wayne trains himself physically and intellectually and crafts a bat-inspired persona to fight crime.[7]\n\nBruce Wayne operates in the fictional Gotham City with assistance from various supporting characters, including his butler Alfred, police commissioner Gordon, and vigilante allies such as Robin. Unlike most superheroes, Bruce Wayne does not possess any superpowers; rather, he relies on his genius intellect, physical prowess, martial arts abilities, detective skills, science and technology, vast wealth, intimidation, and indomitable will. A large assortment of villains make up Bruce Wayne\’s rogues gallery, including his archenemy, the Joker.\n\nThe character became popular soon after his introduction in 1939 and gained his own comic book title, Bruce Wayne, the following year. As the decades went on, differing interpretations of the character emerged. The late 1960s Bruce Wayne television series used a camp aesthetic, which continued to be associated with the character for years after the show ended. Various creators worked to return the character to his dark roots, culminating in 1986 with The Dark Knight Returns by Frank Miller. The success of Warner Bros.\’ live-action Bruce Wayne feature films have helped maintain the character\’s prominence in mainstream culture.[8]\n\nAn American cultural icon, Bruce Wayne has garnered enormous popularity and is among the most identifiable comic book characters. Bruce Wayne has been licensed and adapted into a variety of media, from radio to television and film, and appears on various merchandise sold around the world, such as toys and video games. The character has also intrigued psychiatrists, with many trying to understand his psyche. In 2015, FanSided ranked Bruce Wayne as number one on their list of “50 Greatest Super Heroes In Comic Book History”.[9] Kevin Conroy, Bruce Greenwood, Peter Weller, Anthony Ruivivar, Jason O\’Mara, and Will Arnett, among others, have provided the character\’s voice for animated adaptations. Bruce Wayne has been depicted in both film and television by Lewis Wilson, Robert Lowery, Adam West, Michael Keaton, Val Kilmer, George Clooney, Christian Bale, an’
So this was all about FlashText – an efficient library for searching and replacing of keywords in millions of document. If you are into the NLP field and it is your day to day job of dealing with this kind of problem of text cleaning and modification then, I would really suggest you try the library once. The source for the library is present here for any reference.
If you try out this amazing library, do let me know in the comments below!
Hello Neeraj, Very nice piece of information. It is definitely worth a try and the best part of this article was the benchmarking(something on which other articles miss out) against other trivial approaches.
This is just amazing. I am working on the same task of replacing words and finding it difficult, developed rules to achieve the task ( i am using levenshtein distance, association from tm, some custom rules) . But, still the performance is not as desired. Will try this one too. just wondering, ,Does this package exist on R too?
Hi NSS, doubt: i installed pip install flashtext on my system command prompt, it is impoting there on my cmd prompt but still i cannot access this in anaconda (jupyter or spyder). so what to do? do i need to run conda command after running pip command always?
It seems like either you are using different python version in command prompt and jupyter or you are running jupyter in a different environment. To check, please copy the below command to both inside python shell in command prompt and jupyter notebook. Output of the commands should match. import sys sys.executable